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Abstract 

Maximum likelihood estimators are often of limited practical use due to the intensive computation 
they require. We propose a family of alternative estimators that maximize a stochastic variation of the 
composite likelihood function. Each of the estimators resolve the computation-accuracy tradeoff differ- 
ently, and taken together they span a continuous spectrum of computation-accuracy tradeoff resolutions. 
We prove the consistency of the estimators, provide formulas for their asymptotic variance, statistical 
robustness, and computational complexity. We discuss experimental results in the context of Boltzmann 
machines and conditional random fields. The theoretical and experimental studies demonstrate the ef- 
fectiveness of the estimators when the computational resources are insufficient. They also demonstrate 
that in some cases reduced computational complexity is associated with robustness thereby increasing 
statistical accuracy. 
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1 Introduction 

Maximum likelihood estimation is by far the most popular point estimation technique in machine learning 
and statistics. Assuming that the data consists of n, m-dimensional vectors 

D A^'^eR", (1) 

and is sampled iid from a parametric distribution p^o with 9q Cz O d W , a maximum likelihood estimator 
(mle) 0™' is a maximizer of the loglikelihood function 

n 

=^logpfl(A«) (2) 

argmaxC(6';L'). (3) 

eee 

The use of the mle is motivated by its consistencj0, i.e. ^Jf' — > 0o as n — > oo with probability 1 [6 . The 
consistency property ensures that as the number n of samples grows, the estimator will converge to the true 
parameter 0q governing the data generation process. 

*To whom correspondence should be addressed. Email: jvdillonSgatech.edu' 

^The consistency S™' — > 9o with probability 1 is sometimes called strong consistency in order to difEcrentiate it from the 
weaker notion of consistency in probability P(|S™' — do\ < e) —> 0. 



An even stronger motivation for the use of the mle is that it has an asymptotically normal distribution 
with mean vector and variance matrix {nI{9o))~^ . More formally, we have the following convergence in 
distribution as n — )■ oo [B] 

V^{df^0o)^N{o,r\eo)), (4) 

where I (9) is the r x r Fisher information matrix 

I{e) = Ep,{Vlogp9(X)(Vlogp9(X))^} (5) 

with V/ representing the r x 1 gradient vector of f{9) with respect to 6. The convergence (HJ is especially 
striking since according to the Cramer-Rao lower bound, the asymptotic variance {nI{9o))~^ of the mle is 
the smallest possible variance for any estimator. Since it achieves the lowest possible asymptotic variance, 
the mle (and other estimators which share this property) is said to be asymptotically efficient. 

The consistency and asymptotic efficiency of the mle motivate its use in many circumstances. Unfortu- 
nately, in some situations the maximization or even evaluation of the loglikelihood ^ and its derivatives is 
impossible due to computational considerations. For instance this is the situation in many high dimensional 
exponential family distributions, including Markov random fields whose graphical structure contains cycles. 
This has lead to the proposal of alternative estimators under the premise that a loss of asymptotic efficiency 
is acceptable-in return for reduced computational complexity. 

In contrast to asymptotic efficiency, we view consistency as a less negotiable property and prefer to avoid 
inconsistent estimators if at all possible. This common viewpoint in statistics is somewhat at odds with 
recent advances in the machine learning literature promoting non-consistent estimators, for example using 
variational techniques [9]. Nevertheless, we feel that there is a consensus regarding the benefits of having 
consistent estimators over non-consistent ones. 

In this paper, we propose a family of estimators, for use in situations where the computation of the mle is 
intractable. In contrast to many previously proposed approximate estimators, our estimators are statistically 
consistent and admit a precise quantification of both computational complexity and statistical accuracy 
through their asymptotic variance. Due to the continuous parameterization of the estimator family, we obtain 
an effective framework for achieving a predefined problem-specific balance between computational tractability 
and statistical accuracy. We also demonstrate that in some cases reduced computational complexity may in 
fact act as a regularizer, increasing robustness and therefore accomplishing both reduced computation and 
increased accuracy. This "win-win" situation confiicts with the conventional wisdom stating that moving 
from the mle to pseudo-likelihood and other related estimators result in a computational win but a statistical 
loss. Nevertheless we show that this occurs in some practical situations. 

For the sake of concreteness, we focus on the case of estimating the parameters associated with Markov 
random fields. In this case, we provide a detailed discussion of the accuracy-complexity tradeoff. We include 
experiments on both simulated and real world data for several models including the Boltzmann machine, 
conditional random fields, and the Boltzmann linear chain model. 

2 Related Work 

There is a large body of work dedicated to tractable learning techniques. Two popular categories are 
Markov chain Monte Carlo (MCMC) and variational methods. MCMC is a general purpose technique for 
approximating expectations and can be used to approximate the normalization term and other intractable 
portions of the loglikelihood and its gradient [1] . Variational methods are techniques for conducting inference 
and learning based on tractable bounds [9]. 

Despite the substantial work on MCMC and variational methods, there are little practical results concern- 
ing the convergence and approximation rate of the resulting parameter estimators. Variational techniques are 
sometimes inconsistent and it is hard to analyze their asymptotic statistical behavior. In the case of MCMC, 
a number of asymptotic results exist [4 , but since MCMC plays a role inside each gradient descent or EM 
iteration it is hard to analyze the asymptotic behavior of the resulting parameter estimates. An advantage 
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of our framework is that we are able to directly characterize the asymptotic behavior of the estimator and 
relate it to the amount of computational savings. 

Our work draws on the composite likelihood method for parameter estimation proposed by (13j which in 
turn generalized the pseudo likelihood of 2 . A selection of more recent studies on pseudo and composite 
likelihood are pQ [11] [^Hl [ISl [Z] • Most of the recent studies in this area examine the behavior of the pseudo 
or composite likelihood in a particular modeling situation. We believe that the present paper is the first to 
systematically examine statistical and computational tradeoffs in a general quantitative framework. Possible 
exceptions are |22j which is an experimental study on texture generation, [211 which is focused on inference 
rather than parameter estimation, and |12] which compares discriminative and generative risks. 

3 Stochastic Composite Likelihood 

In many cases, the absence of a closed form expression for the normalization term prevents the computation 
of the loglikelihood ([2]) and its derivatives thereby severely limiting the use of the mle. A popular example is 
Markov random fields, wherein the computation of the normalization term is often intractable (see Section [5] 
for more details). In this paper we propose alternative estimators based on the maximization of a stochastic 
variation of the composite likelihood. 

(r) 

We denote multiple samples using superscripts and individual dimensions using subscripts. Thus 
refers to the j-dimension of the r sample. Following standard convention we refer to random variables (RV) 
using uppercase letters and their corresponding values using lowercase letters. We also use the standard 
notations for extracting a subset of the dimensions of a random variable 

Xs {Xr.ie S}, X_, {X, : ^ ^ j}. (6) 

We start by reviewing the pseudo loglikelihood function [2^ associated with the data D ([T]), 

n m 

pi4B;D) ^^logp,(xf|X«). (7) 

The maximum pseudo likelihood estimator (mple) is consistent i.e., 6'™p' — >■ 9q with probability 1, but 
possesses considerably higher asymptotic variance than the mle's {nI{6o))~^ . Its main advantage is that it 
does not require the computation of the normalization term as it cancels out in the probability ratio defining 
conditional distributions 

PeiX,\X^,)^MX,\{X, : k ^ j}) ^ "'f ^-. x 

The mle and mple represent two different ways of resolving the tradeoff between asymptotic variance 
and computational complexity. The mle has low asymptotic variance but high computational complexity 
while the mple has higher asymptotic variance but low computational complexity. It is desirable to obtain 
additional estimators realizing alternative resolutions of the accuracy complexity tradeoff. To this end we 
define the stochastic composite likelihood whose maximization provides a family of consistent estimators 
with statistical accuracy and computational complexity spanning the entire accuracy-complexity spectrum. 

Stochastic composite likelihood generalizes the likelihood and pseudo likelihood functions by constructing 
an objective function that is a stochastic sum of likelihood objects. We start by defining the notion of m-pairs 
and likelihood objects and then proceed to stochastic composite likelihood. 

Definition 1. An m-pair {A, B) is a pair of sets A,Bc {1, . . . , m} satisfying A 7^ = AOB. The likelihood 
object associated with an m-pair {A,B) and X is Sg{A,B) = logpe{XA\XB) where Xs is defined in ([6]). 
The composite loglikelihood function [13] is a collection of likelihood objects defined by a finite sequence of 
m-pairs {Ai,Bi), . . . , {Ak,Bk) 

n k 

cUe-D) ^5]logp,(X«|X«). (9) 

i=l j=l 
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There is a certain lack of flexibility associated with the composite likelihood framework as each likelihood 
object is either selected or not for the entire sample X^^\ . . . ,X^'^\ There is no allowance for some objects 
to be selected more frequently than others. For example, available computational resources may allow the 
computation of the loglikelihood for 20% of the samples, and the pseudo-likelihood for the remaining 80%. 
In the case of composite likelihood if we select the full-likelihood component (or the pseudo-likelihood or any 
other likelihood object) then this component is applied to all samples indiscriminately. 

In SCL, different likelihood objects Sg{Aj,Bj) may be selected for different samples with the possibility 
of some likelihood objects being selected for only a small fraction of the data samples. The selection may be 
non-coordinated, in which case each component is selected or not independently of the other components. 
Or it may be coordinated in which case the selection of one component depends on the selection of the other 
ones. For example, we may wish to avoid selecting a pseudo likelihood component for a certain sample X*^*^ 
if the full likelihood component was already selected for it. 

Another important advantage of stochastic selection is that the discrete parameterization of ^ defined 
by the sequence {Ai,Bi), . . . , {Ak, Bk) is less convenient for theoretical analysis. Each component is either 
selected or not, turning the problem of optimally selecting components into a hard combinatorial problem. 
The stochastic composite likelihood, which is defined below, enjoys continuous parameterization leading to 
more convenient optimization techniques and convergence analysis. 

Definition 2. Consider a finite sequence of m-pairs (Ai, _Bi), . . . , (Afc, Bfe), a dataset D = (X*^^), . . . , X^"^), 
P e and m iid binary random vectors Z^^\ . . . , Z^™) ^ P{Z) with \j = E (Zj ) > 0. The stochastic 
composite loglikelihood (scl) is 

1 " 

scC(6';i:') = - Vme(xW,zW), where (10) 
n ^-^ 

k 

mg{X, Z) Po^o logPe(XA, \Xb,). (11) 

i=i 

In other words, the scl is a stochastic extension of © where for each sample X^'^\i = l,...,n, the 
likelihood objects S{Ai,Bi), . . . , S{Ak,Bk) are either selected or not, depending on the values of the binary 

(i) (i) (i) 

random variables , . . . , Zm and weighted by the constants /3i , . . . , . Note that may in general 

depend on Zr^^ but not on Zr^ or on X^^\ 

When we focus on examining different models for P{Z) we sometimes parameterize it, for example by A 
i.e., P\{Z). This reuse of A (it is also used in Definition [2]) is a notational abuse. We accept it, however, as 
in most of the cases that we consider Ai, . . . , A^: from Definition [2] cither form the parameter vector for P{Z) 
or are part of it. 

Some illustrative examples follow. 

Independence. Factorizing P\{Zi, . . . , Zk) = Yij P>^i i^j) leads to zj*-* ^ Ber(Aj) with complete indepen- 
dence among the indicator variables. For each sample X^'^\ each likelihood object S{Aj,Bj) is selected 
or not independently with probability Xj. 

Multinomial. A multinomial model Z ^ A/Iult(l,A) implies that for each sample Z^^^ a multivariate 
Bernoulli experiment is conducted with precisely one likelihood object being selected depending on 
the selection probabilities Ai, . . . , A^. 

Product of Multinomials. A product of multinomials is formed by a partition of the dimensions to I 
disjoint subsets {1, . . . , to} = C\ U ■ ■ ■ Ci where Zd ~ Mult(l, (A^ : j e Ci)) i.e., 

c 

P{Z) =l[P, {{Zj : J e a}) , where is Muh(l, (A, : j e Q)). 

i=l 
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Loglinear Models. The distribution P{Z) follows a hierarchical loglinear model [3]. This case subsumes 
the other cases above. 

In analogy to the mle and the mple, the maximum scl estimator (mscle) 6™^^ estimates by maximizing 
the scl function. In contrast to the loglikelihood and pseudo loglikelihood functions, the scl function and its 
majcimizer are random variables that depend on the indicator variables Z^'^\ . . . , Z^"^ in addition to the data 
D. As such, its behavior should be summarized by examining the limit n — >■ oo. Doing so eliminates the 
dependency on particular realizations of . . . , Z^"^ in favor of the the expected frequencies \j = E p(^z)Zj 
which are non-random constants. 

The statistical accuracy and computational complexity of the msl estimator are continuous functions of 
the parameters A) (components weights and selection probabilities respectively) which vary continuously 
throughout their domain (A, /3) e A x K';^. Choosing appropriate values of (A, /3) retrieves the special 
cases of mle, mple, maximum composite likelihood with each selection being associated with a distinct 
statistical accuracy and computational complexity. The scl framework allows selections of many more values 
of (A, /3) realizing a wide continuous spectrum of estimators, each resolving the accuracy-complexity tradeoff 
differently. 

We include below a demonstration of the scl framework in a simple low dimensional case. In the following 
sections we discuss in detail the statistical behavior of the mscle and its computational complexity. We 
conclude the paper with several experimental studies. 



3.1 Boltzmann Machine Example 

Before proceeding we illustrate the SCL framework using a simple example involving a Boltzmann machine 
[S]. We consider in detail three SCL policies: full likelihood (FL), pseudo-likelihood (PL), and a stochastic 
combination of first and second order pseudo-likelihood with the first order components {p{Xi\X_i)) selected 
with probability A and the second order components {p{Xi, Xj\X^^ jy<:)) with probability 1 — A. 

Denoting the number of (binary) graph nodes by m, the number of examples by n, the computational com- 
plexity of the FL function (FLOfH counts) is O ^ (2™ + n)^ (loglikelihood) and O 

(loglikelihood gradient). The exponential growth in m prevents such computations for large graphs. 

The fc-order PL function offers a practical alternative to FL (1-order PL correspond to the traditional 
pseudo-likelihood and 2-order is its analog with second order components p{X^^ jy |A{j j-i,c)). The complexity 

TO \ / / to\ „j. 




of computing the corresponding SCL function isOll^jM^jS+rt)) (for the objective function) 



TO 
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and O I ( '^'^ ) ( ) 2'' + n [ ) ) (for the gradient). The slower complexity growth of the /c-order PL 

(polynomial in m instead of exponential) is offset by its reduced statistical accuracy, which we measure using 
the normalized asymptotic variance 

eff(4) = det(Asymp Yarji,)) 
det(Asymp Var(6'™i<=)) 

which is bounded from below by 1 (due to Cramer Rao lower bound) and its deviation from 1 reflects its 
inefficiency relative to the MLE. 

The MLE thus achieves the best accuracy but it is computationally intractable. The first order and 
second order PL have higher asymptotic variance but are easier to compute. The SCL framework enables 
adding many more estimators filling in the gaps between ML, 1-order PL, 2-order PL, etc. 

We illustrate three SCL functions in the context of a simple Boltzmann machine (five binary nodes, 
fourteen samples A(i), ... ,A(i4), 6'*™'= = (-1, -1, -1, -1, -1, 1, 1, 1, 1, 1)) in Figure □ The top box refers 
to the full likelihood policy. For each of the fourteen samples, the FL component is computed and their 



^FLOP stands for the number of floating point operations. 
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aggregation forms the SCL function which in this case equals the loghkehhood. The selection of the FL 
component for each sample is illustrated using a diamond box. The numbers under the boxes reflect the 
FLOP counts needed to compute the components and the total complexity associated with computing the 
entire SCL or loglikelihood is listed on the right. As mentioned above, the normalized asymptotic variance 
(Ha) is 1. 

The pseudo-likelihood function ([7]) is illustrated in the second box where each row correspond to one 
of the five PL components. As each of the five PL component is selected for each of the samples we have 
diamond boxes covering the entire 5 x 14 array. The shade of the diamond boxes reflects the complexity 
required to compute them enabling an easy comparison to the FL components in the top of the figure (note 
how the FL boxes are much darker than the PL boxes). The numbers at the bottom of each column reflect 
the FLOP marginal count for each of the fourteen samples and the numbers to the right of the rows reflect 
the FLOP marginal count for each of the PL components. In this case the FLOP count is less than half 
the FLOP count of the FL in top box (this reduction in complexity obtained by replacing FL with PL will 
increase dramatically for graphs with more than 5 nodes) but the asymptotic variance is 83% higheJl. 

The third SCL policy reflects a stochastic combination of flrst and second order pseudo likelihood compo- 
nents. Each first order component is selected with probability A and each second order component is selected 
with probability 1 — A. The result is a collection of 5 1-order PL components and 10 2-order components with 
only some of them selected for each of the fourteen samples. Again diamond boxes correspond to selected 
components which are shaded according to their FLOP complexity. The per-component FLOP marginals 
and per example FLOP marginals are listed as the bottom row and right-most column. The total complexity 
is somewhere between FL and PL and the asymptotic variance is reduced from the PL's 183% to 148%. 

Additional insight may be gained at this point by considering Figure [3] which plots several SCL estima- 
tors as points in the plane whose x and y coordinates correspond to normalized asymptotic variance and 
computational complexity respectively. We turn at this point to considering the statistical properties of the 
SCL estimators. 

4 Consistency and Asymptotic Variance of 6^^^ 

A nice property of the SCL framework is enabling mathematical characterization of the statistical properties 
of the estimator 0™^^. In this section we examine the conditions for consistency of the mscle and its asymp- 
totic distribution and in the next section we consider robustness. The propositions below constitute novel 
generalizations of some well-known results in classical statistics. Proofs may be found in Appendix [X] For 
simplicity, we assume that X is discrete and pg{x) > 0. 

Definition 3. A sequence of m-pairs (Ai, Bi), . . . , {A^, B^) ensures identiflability oipe if the map {pe{XAj {Xb^ ) : 
J = l,...,fc} peiX) is injective. In other words, there exists only a single collection of conditionals 
{pg{XAj \XBj) : J = 1, • • • , fc} that does not contradict the joint pg{X). 

Proposition 1 . Let Q C MJ" be an open set, pg{x) > and continuous and smooth in 6, and (Ai, Bi), . . . , ( , B^ ) 
be a sequence of m-pairs for which {{Aj, Bj) : Vj such that Xj > 0} ensures identiflability. Then the sequence 
of SCL maximizers is strongly consistent i.e., 



The above proposition indicates that to guarantee consistency, the sequence of m-pairs needs to satisfy 
Definition [31 It can be shown that a selection equivalent to the pseudo likelihood function, i.e.. 



ensure identiflability and consequently the consistency of the mscle estimator. Furthermore, every selection 
of TO-pairs that subsumes S in (1141) similarly guarantees identiflability and consistency. 

^The asymptotic variance of SCL functions is computed using formulas derived in the next section 




(13) 



5 = {(Ai,Bi),...,(A™,B„)} where A, = {t}, B, ^ {1, . . . ,m} \ A, 



(14) 
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Figure 1: Sample runs of three different SCL policies for 14 examples X^^\ . . . , X'^*' drawn 
from a 5 binary node Boltzmann machine (g*''"'' — (— 1, — 1, — 1, — 1, — 1, 1, 1, 1, 1, 1)). The 
policies are full likelihood (FL, top), pseudo-likelihood (PL, middle), and a stochastic com- 
bination of first and second order pseudo-likelihood with the first order components selected 
with probability 0.7 and the second order components with probability 0.3 (bottom). 

The sample runs for the policies are illustrated by placing a diamond box in table entries corre- 
sponding to selected likelihood objects (rows corresponding to likelihood objects and columns 
to X^^\ . . . , X^^*'). The FLOP counts of each likelihood object determines the shade of the 
diamond boxes while the total FLOP counts per example and per likelihood objects are dis- 
played as table marginals (bottom row and right column for each policy). We also display the 
total FLOP count and the normalized asymptotic variance l|12|) . 

Even in the simple case of 5 nodes, FL is the most complex policy with PL requiring a third of 
the FL computation. 0.7PL-I-0.3PL2 is somewhere in between. The situation is reversed for 
the estimation accuracy-FL achieves the lowest possible normalized asymptotic variance of 1, 
PL is almost twice that, and 0.7PL-I-0.3PL2 somewhere in the middle. The SCL framework 
spans the accuracy-complexity spectrum. Choosing the right A value obtains an estimator 
that is suits available computational resources and required accuracy. 
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The proposition below establishes the asymptotic normality of the mscle The asymptotic variance 
enables the comparison of scl functions with different parameterizations (A,/3). 

Proposition 2. Making the assumptions of Proposition [7] as well as convexity of Q d W we have the 
following convergence in distribution 

V^iOr^ -9o)^N (0, TST) (15) 

where 

k 

= E ^J-^^- Vare„{^SeM,,B,)) (16) 

E = Vare, fi, A, V5,„ {A, , B, ) j . (17) 

The notation Vare„(y) represents the covariance matrix of the random vector Y under pg^ while the 
notations A , in the proof below denote convergences in probability and in distribution [6] . V represents 
the gradient vector with respect to 9. 

When is a vector the asymptotic variance is a matrix. To facilitate comparison between different 
estimators we follow the convention of using the determinant, and in some cases the trace, to measure the 
statistical accuracy. See [16] for some heuristic arguments for doing so. Figures fl I2I3I provide the asymptotic 
variance for some SCL estimators and describe how it can be used to gain insight into which estimator to 
use. 

The statistical accuracy of the SCL estimator depends on /? (weight parameters) and A (selection pa- 
rameter). It is thus desirable to use the results in this section in determining what values of /?, A to use. 
Directly using the asymptotic variance is not possible in practice as it depends on the unknown quantity 
Oq. However, it is possible to estimate the asymptotic variance using the training data. We describe this in 
Section [71 



5 Robustness of 9^^^ 

We observed in our experiments (see Section [8]) that the SCL estimator sometimes performs better on a held- 
out test set than did the maximum likelihood estimator. This phenomenon seems to be in contradiction to 
the fact that the asymptotic variance of the MLE is lower than that of the SCL maximizer. This is explained 
by the fact that in some cases the true model generating the data does not lie within the parametric family 
{pe ■ G 0} under consideration. For example, many graphical models (HMM, CRF, LDA, etc.) make 
conditional independence assumptions that are often violated in practice. In such cases the SCL estimator 
acts as a regularizer achieving better test set performance than the non-regularized MLE. We provide below 
a theoretical account of this phenomenon using the language of m-estimators and statistical robustness. Our 
notation follows the one in [19]. 

We assume that the model generating the data is outside the model family P{X) <^ {pg : 6 E Q} and we 
augment mg {X, Z) in ([TT]) with 

M^, Z) "^VmgiX, Z) 

ipg{X, Z) = V^me(X, Z) (matrix of second order derivatives) 

n 

*«(e) = - Vv^e(X«,Z«). 
n ^ — ' 
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Proposition [3] below generalizes the consistency result by asserting that d„ — > where Oq is the point on 
{pe : 9 £ <d} that is closest to the true model P, as defined by 



^0 = argmaxM(0) where M(^^) = - ^ /3, A,-D(P(Xa, [^^(X^JXbJ), (18) 



fc 

xM{e) where M{e) = - ' 
or equivalently, satisfies 

Ep(x)Ep(z)V^eo(^,^)=0. (19) 

When the scl function reverts to the loglikelihood function, Oq becomes the KL projection of the true model 
P onto the parametric family {pg : 6 G Q}. 

Proposition 3. Assuming the conditions in Provosition [7] as well as supg.||g_g ii^^ M{9) < M{9o) for all 
e > we have 6™"'' Oq as n ^ oo with probability 1. 

The added condition maintains that Oq is a well separated maximum point of M. In other words it asserts 
that only values close to 9q may yield a value of M that is close to the maximum AI{6o). This condition is 
satisfied in the case of most exponential family models. 

Proposition 4. Assuming the conditions of Proposition \^ as well as Ep(x)f p(z)||^eo(^7 ■2^)11'^ < cx3, 
E pi^x)^ Pi^z)4'9o{'^) exists and is non-singular, l^^ijl = \d'^ipe{x)/d9i0j\ < g(x) for all i,j and in a neigh- 
borhood of 9o for some integrable g, we have 

n 

MOn - Oo) - -iEp^x)Epiz)i^6„r'-^Y. V'e„(^(*\ ^«) + op(l) (20) 

i—l 

or equivalently 



1 " /I 



(21) 



Above, /„ — op{gn) means fn/gn converges to with probability 1. 

Corollary 1. Assuming the conditions specified in Proposition^ we have 

V^{On - Oo) - A^(0, {Ep^x)Epiz)i'eoyHEp^x)Ep^z)i'eoyji){Epix)Epiz)M-^). (22) 

Equation (PT|) means that asymptotically, 9n behaves as 6*0 plus the average of iid RVs. As mentioned 
in |19| this fact may be used to obtain a convenient expression for the asymptotic influence function, which 
measures the effect of adding a new observation to an existing large dataset. Neglecting the remainder in 
(PH)) we have 

I{x, z) = L{X^^\ a;, . . . , Z^"-^\ z) - 9n-i{X^^\ . . . , . . . , z("-i)) 

/ n-l n-1 \ 

« -(E p(x) E P(z) V-.J-i - E ' ^^'^) + -^«o ^) T E (^^^^ ' ^^'^) 

\ n ^-^ n n — 1 ^-^ I 

\ i=l i=l / 

1 1 ""^ 

= -(Ep(;,)Ep(^)^)-i-^,„(u;,z) + (Ep(x)Ep(^)V'9j-i^— -^Vflo(^'^\^^^') 

^ ' i=l 

= --(Ep(x)Ep(z)^eo)~Veo(w,2;) + op ( - ) . (23) 



Corollary [T] and Equation [53] measure the statistical behavior of the estimator when the true distribution 
is outside the model family. In these cases it is possible that a computationally efficient SCL maximizer will 
result in higher statistical accuracy as well. This "win-win" situation of improving in both accuracy and 
complexity over the MLE is confirmed by our experiments in Section |8l 
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6 Stochastic Composite Likelihood for Markov Random Fields 



Markov random fields (MRF) are some of tire more popular statistical models for complex high dimensional 
data. Approaches based on pseudo likelihood and composite likelihood are naturally well-suited in this case 
due to the cancellation of the normalization term in the probability ratios defining conditional distributions. 
More specifically, a MRF with respect to a graph G ~ {V, E), V = {1, . . . , m} with a clique set C is given 
by the following exponential family model 

Peix) = cxp [ ^ 9cfc{xc) - logZ(0) J , 

Z(^^) = ^exp(^0,/c(a:c)) • (24) 

The primary bottlenecks in obtaining the maximum likelihood are the computations log Z{ff) and V log Z{9). 
Their computational complexity is exponential in the graph's treewidth and for many cyclic graphs, such as 
the Ising model or the Boltzmann machine, it is exponential in \V\ = m. 

In contrast, the conditional distributions that form the composite likelihood of ([M)) are given by (note 
the cancellation of Z(9)) 



Pe{xA\xB) = 7 (25) 



E Eexp ( E ^c/c((a:A'2;B,a;'(^us)"=)c) 

whose computation is substantially faster. Specifically, The computation of ([25]) depends on the size of 
the sets A and (A U By and their intersections with the cliques in C. In general, selecting small \Aj\ and 
Bj = {AjY leads to efficient computation of the composite likelihood and its gradient. For example, in 
the case of \Aj\ — I, \Bj\ = in — I with Z <C m we have that k < m!/(/!(m — and the complexity of 
computing the ci{9) function and its gradient may be shown to require time that is at most exponential in 
I and polynomial in m. 



7 Automatic Selection of /3 

As Proposition [5] indicates, the weight vector /3 and selection probabilities A play an important role in the 
statistical accuracy of the estimator through its asymptotic variance. The computational complexity, on the 
other hand, is determined by A independently of (3. Conceptually, we are interested in resolving the accuracy- 
complexity tradeoff jointly for both /3, A before estimating 9 by maximizing the scl function. However, since 
the computational complexity depends only on A we propose the following simplified problem: Select A 
based on available computational resources, and then given A, select the /3 (and 6) that will achieve optimal 
statistical accuracy. 

Selecting /? that minimizes the asymptotic variance is somewhat ambiguous as TET in Proposition [2] is 
an r X r positive scmidefinite matrix. A common solution is to consider the determinant as a one dimensional 
measure of the size of the variance matrisB, and minimize 

J(/3) =logdet(TI]T) =logdetS + 21ogdetT. (26) 

A major complication with selecting (3 based on the optimization of (|26p is that it depends on the true 
parameter value 6q which is not known at training time. This may be resolved, however, by noting that 
()26p is composed of covariance matrices under 9o which may be estimated using empirical covariances over 

*See |16l for a heuristic discussion motivating this measre. 
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the training set. To facilitate fast computation of the optimal j3 we also propose to replace the determinant 
in (|26p with the product of the digaonal elements. Such an approximation is motivated by Hadamard's 
inequality (which states that for symmetric matrices det(M) < Ma) and by Gersgorin's circle theorem 
(see below). This approximation works well in practice as we observe in the experiments section. We also 
note that the procedure described below involves only simple statisics that may be computed on the fly and 
does not contribute significant additional computation (nor do they require significant memory). 

More specifically, we denote if^'^) = Cov0„(VS'0o(A,, B;), V5eo(Aj, B^)) with entries K^^ , and approx- 
imate the log det terms in (l26l) using 

k r k 

log det T = - log det ^ l3j \j K^^^'> ~ -J2 J2 ^u^^ (^'^^ 

(fe \ fc fc 

^/3jA,V5eo(A,,B,) = logdet ^ ^ /3,A,/3, A,i^(*j) 
J = l / i=l J = l 

r k k 

We denote (assuming yl is a n x n matrix) for i e {1, . . . , n}, Ri{A) — ^-^^ D{Aii, Ri{A)) 

[Di where unambiguous) be the closed disc centered at An with radius Ri{A). Such a disc is called a 
Gersgorin disc. The result below states that for matrices that are close to diagonal, the eigenvalues are close 
to the diagonal elements making our approximation accurate. 

Theorem 1 (Gersgorin's circle theorem e.g., 0). Every eigenvalue of A lies within at least one of the 
Gersgorin discs D(Aii, Ri{A)). Furthermore, if the union of k discs is disjoint from the union of the remaining 
n — k discs, then the former union contains exactly fc and the latter n — k eigenvalues of A. 

The following algorithm solves for 0, /? jointly using alternating optimization. The second optimization 
problem with respect to (5 is done using the approximation above and may be computed without much 
additional computation. In practice we found that such an approach lead to a selection of j3 that is close to 
the optimal /3 (see Sec. 18.31 and Figures [TH EOl for results). 



Algorithm 1 Calculate 
Require: X, /3o, and 7 
1: i ^ 1 

3: while i < MAXITS do 
4: 6 ^ arg min sc£{X, A, /3) 
5: if converged then 
6; return 9 
7: else 

8: (3 ^ arg min J'(A, A, 9, 7) 

9: i -s— i + 1 
10: end if 
11: end while 
12: return false 



8 Experiments 

We demonstrate the asymptotic properties of 6'™''' and explore the complexity-accuracy tradeoff for three 
different models-Boltzmann machine, linear Boltzmann MRF and conditional random fields. In terms of 
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datasets, we consider synthetic data as well as datasets from sentiment prediction and text chunking domains. 

8.1 Toy Example: Boltzmann Machines 

We illustrate the improvement in asymptotic variance of the mscle associated with adding higher order 
likelihood components with increasing probabilities in context of the Boltzmann machine 

Pe{x) = exp - \ogij{e)j , xe{0, 1}'". (29) 

To be able to accurately compute the asymptotic variance we use m = 5 with 9 being a (2) dimensional 
vector with half the components +1 and half —1. Since the asymptotic variance of 9^^^ is a matrix we 
summarize its size using either its trace or determinant. 

Figure [5] displays the asymptotic variance, relative to the minimal variance of the mle, for the cases 
of full likelihood (FL), pseudo likelihood {\Aj\ = 1) PLi, stochastic combination of pseudo likelihood and 
2nd order pseudo likelihood {\Aj\ = 2) components aPL2 + (1 — q:)PLi, stochastic combination of 2nd 
order pseudo likelihood and 3rd order pseudo likelihood {\Aj\ — 3) components aPLa + (1 — Q!)PL2, and 
stochastic combination of 3rd order pseudo likelihood and 4th order pseudo likelihood {\Aj \ = 4) components 
aPU + (1 - a)PL3. 

The graph demonstrates the computation-accuracy tradeoff as follows: (a) pseudo likelihood is the fastest 
but also the least accurate, (b) full likelihood is the slowest but the most accurate, (c) adding higher order 
components reduces the asymptotic variance but also requires more computation, (d) the variance reduces 
with the increase in the selection probability a of the higher order component, and (e) adding 4th order 
components brings the variance very close the lower limit and with each successive improvement becoming 
smaller and smaller according to a law of diminishing returns. 

Figure [3] displays the asymptotic accuracy and complexity for different SCL policies for m — 9. We 
see how taking different linear combinations of pseudo likelihood orders spans a continuous spectrum of 
accuracy-complexity resolutions. The lower part of the diagram is the boundary of the achievable region 
(the optimal but unachievable place is the bottom left corner). SCL policies that lie to the right and top of 
that boundary may be improved by selecting a policy below and to the left of it. 

8.2 Local Sentiment Prediction 

Our first real world dataset experiment involves local sentiment prediction using a conditional MRF model. 
The dataset consisted of 249 movie review documents having an average of 30.5 sentences each with an 
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Figure 3: Computation-accuracy diagram for three SCL famiIies:Ai/3iPLl + A2(l — /3i)PL2, 
Ai/3iPLl + A2(l-/?i)PL3, Ai/3iPL2 + A2(l-/3i)PL3 (for multiple values of Ai,A2,/3i) for the 
Boltzmann machine with 9 binary nodes. The pure policies PLl and PL2 are indicated by 
black circles and the computational complexity of the full likelihood indicated by a dashed 
line (corresponding normalized asymptotic variance is 1). As the number of nodes increase the 
computational cost increase dramatically, in particular for the full likelihood and to a lesser 
extend for the pseudo likelihood policies. 




Figure 4: Graphical representation of a four token conditional random field (CRF). A, B 
are positive weight matrices and represent state-to-state transitions and state-to-observation 
outputs. Shading indicates the variable is conditioned upon while no shading indicates the 
variable is generated by the model. 

average of 12.3 words from a 12633 word vocabulary. Each sentence was manually labeled as one of five 
sentimental designations: very negative, negative, objective, positive, or very positive. As described in [15J 
(where more infomration may be found) we considered the task of predicting the local sentiment flow within 
these documents using regularized conditional random fields (CRFs) (see Figure [5] for a graphical diagram 
of the model in the case of four sentences) . 

Figure [S] shows the contour plots of train and test loglikelihood as a function of the scl parameters: 
weight /3 and selection probability A. The likelihood components were mixtures of full and pseudo {\Aj\ = 1) 
likelihood (rows 1,3) and pseudo and 2nd order pseudo {\Aj \ = 2) likelihood (rows 2,4). Aj identifies a set 
of labels corresponding to adjacent sentences over which the probabilistic query is evaluated. Results were 
averaged over 100 cross validation iterations with 50% train-test split. We used BFGS quasi-Newton method 
for maximizing the regularized scl functions. The figure demonstrates how the train loglikelihood increases 
with increasing the weight and selection probability of full likelihood in rows 1,3 and of 2nd order pseudo 
likelihood in rows 2,4. This increase in train loglikelihood is also correlated with an increase in computational 
complexity as higher order likelihood components require more computation. Note however, that the test 
set behavior in the third and fourth rows shows an improvement in prediction accuracy associated with 
decreasing the influence of full likelihood in favor of pseudo likelihood. The fact that this happens for weak 
regularization = 10 indicates that lower order pseudo likelihood has a regularization effect which improves 
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prediction accuracy when the model is not regularized enough. We have encountered this phenomenon in 
other experiments as well and we will discuss it further in the following subsections. 

Figure H] displays the complexity and negative loglikelihoods (left:train, right:test) of different scl esti- 
mators, sweeping through A and /3, as points in a two dimensional space. The shaded area near the origin 
is unachievable as no scl estimator can achieve high accuracy and low computation at the same time. The 
optimal location in this 2D plane is the curved boundary of the achievable region with the exact position on 
that boundary depending on the required solution of the computation-accuracy tradeoff. 

8.3 Text Chunking 

This experiment consists of using sequential MRFs to divide sentences into "text chunks," i.e., syntactically 
correlated sub-sequences, such as noun and verb phrases. Chunking is an crucial step towards full parsing. 
For exampl^, the sentence: 

He reckons the current account deficit will narrow to only # 1.8 billion in September. 

could be divided as: 

[NP He ] [VP reckons ] [NP the current account deficit ] [VP will narrow ] [PP to ] [NP only # 1.8 billion 
] [PP in ] [NP September ]. 

where NP, VP, and PP indicate noun phrase, verb phrase, and prepositional phrase. 

We used the publicly available CoNLL-2000 shared task dataset. It consists of labeled partitions of a 
subset of the Wall Street Journal (WSJ) corpus. Our training sets consisted of sampling 100 sentences 
without replacement from the the CoNLL-2000 training set (211,727 tokens from WSJ sections 15-18). The 
test set was the same as the CoNLL-2000 testing partition (47,377 tokens from WSJ section 20). Each of the 
possible 21,589 tokens, i.e., words, numbers, punctuation, etc., are tagged by one of 11 chunk types and an O 
label indicating the token is not part of any chunk. Chunk labels are prepended with flags indicating that the 
token begins (B-) or is inside {!-) the phrase. Figure [7] lists all labels and respective frequencies. In addition 
to labeled tokens, the dataset contains a part-of-speech (POS) column. These tags were automatically 
generated by the Brill tagger and must be incorporated into any model/feature set accordingly. 

In the following, we explore this task using various scl selection polices on two related, but fundamentally 
different sequential MRFs: Boltzmann chain MRFs and CRFs. 

8.3.1 Boltzmann Chain MRF 

Boltzmann chains are a generative MRF that are closely related to hidden Markov models (HMM). See 
[l4] for a discussion on the relationship between Boltzmann chain MRFs and HMMs. We consider SCL 
components of the form P {X2 , 12 1 i^i , ^3 ) , P {X2 , ^3 ,12,^3111,^4) which we refer to as first and second order 
pesudo likelihood (with higher order components generalizing in a straightforward manner). 

The nature of the Boltzmann chain constrains our feature set to only encode the particular token present 
at each position, or time index. In doing so we avoid having to model additional dependencies across 
time steps and dramatically reduce computational complexity. Although scl is precisely motivated by high 
treewidth graphs, we wish to include the full likelihood for demonstrative purposes-in practice, this is often 
not possible. Although POS tags are available we do not include them in these features since the dependence 
they share on neighboring tokens and other POS tags is unclear. For these reasons our time-sliced feature 
vector, Xi, has only a single-entry one and cardinality matching the size of the vocabulary (21,589 tokens). 

As is common practice, we curtail overfitting through a L2 regularizer, exp{— 253-||^||2}i which is is strong 
when (T^ is small and weak when is large. We consider tr^ a hyper-parameter and select it through cross- 
validation, unless noted otherwise. More often though, we show results for several representative cr^ to 
demonstrate the roles of A and (3 in O'^'^K 

Figures [10] and [TT] depict train and test negative log-likelihood, i.e., perplexity, for the scl estimator 
^100' with a pseudo/full likelihood selection policy (PLl/FL). As is our convention, weight f3 and selection 

^Taken from the CoNLL-2000 shared task site, |http : //www. cnts .ua. ac ■be/conll2000/chunklng/ [ 
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Figure 5: Train (left) and test (right) loglikelihood contours for maximum scl estimators for 
the CRF model. L2 regularization parameters are = 1 (rows 1,2) and = 10 (rows 3,4). 
Rows 1,3 are stochastic mixtures of full (Fy^and pseudo (PLl) loglikelihood components 
while rows 2,4 are PLl and 2nd order pseudo likelihood (PL2). 
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Figure 6: Scatter plot representing complexity and negative loglikelihood (left:train, 
riglit:test) of scl functions for CRFs with regularization parameter = 1/2. The points 
represent different stochastic combinations of full and pseudo likelihood components. The 
shaded region represents impossible accuracy /complexity demands. 




Figure 8: Graphical representation of a four token Boltzmann chain. A, B are positive weight 
matrices and represent preference in particular state-to-state transitions and state-to-feature 
emissions. Only the start state is conditioned upon while all others are generative. 
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Figure 9: Accuracy and complexity tradeoff for the Boltzmann chain MRF with PLl/FL 
(left) and PL1/PL2 (right) selection policies. Each point represents the negative loglikelihood 
(perplexity) and the number of flops required to evaluate the composite likelihood and its 
gradient under a particular instantiation of the selection policy. The shaded region represents 
empirically unobtainable combinations of computational complexity and accuracy. 



probability A correspond to the higher order component, in this case full likelihood. The lower order pseudo 
likelihood component is always selected and has weight 1 — /3. As expected the test set perplexity dominates 
the train-set perplexity. As was the situation in Sec. 18. 2[ we note that the lower order component serves to 
regularize the full- likelihood, as evident by the abnormally large tr^. 

We next demonstrate the effect of using a 1st order/2nd order pseudo likelihood selection policy (PL1/PL2). 
Recall, our notion of pseudo likelihood never entails conditioning on a;, although in principle it could. Figures 
[T^ and [13] show how the policy responds to varying both A and /3. Figure ^ depicts the empirical tradeoff 
between accuracy and complexity. Figure [T3] highlights the effectiveness of the /? heuristic. See captions for 
additional comments. 

8.3.2 CRFs 

Conditional random fields are the discriminative counterpart of Boltzmann chains (cf. Figures S] and [5]) . 
Since x is not jointly modeled with y, we are free to include features with non-independence across time 
steps without significantly increasing the computational complexity. Here our notion of pseudo likelihood 
is more traditional, e.g., P{Y2\Yi,Y,3, X2) and P{Y2,Y3\Yi,Y, 4, X2, X3) are vahd 1st and 2nd order pseudo 
likelihood components. 

We employ a subset of the features outlined in jT7] which proved competitive for the CoNLL-2000 shared 
task. Our feature vector was based on seven feature categories, resulting in a total of 273,571 binary features 
(i.e., /i(-^t) ~ '^)- The feature categories consisted of word unigrams, POS unigrams, word bigrams 
(forward and backward), and POS bigrams (forward and backward) as well as a stopword indicator (and its 
complement) as defined by [TD]. The set of possible feature/label pairs is much larger than our set-we use 
only those features supported by the CoNLL-2000 dataset, i.e., those which occur at least once. Thus we 
modeled 297,041 feature/label pairs and 847 transitions for a total of 297,888 parameters. As before, we use 
the 1/2 regularizer, exp{— 2^||0||2}, which is is stronger when tr^ is small and weak when is large. 

We demonstrate learning on two selection policies: pseudo/full likelihood (Figures [TBI and [T7)) and lst/2nd 
order pseudo likelihood (Figures \TE\ and [TO]) . In both selection polices we note a significant difference from 
the Boltzmann chain, (3 has less impact on both train and test perplexity. Intuitively, this seems reasonable 
as the component likelihood range and variance are constrained by the conditional nature of CRFs. Figure 
[T5] demonstrates the empirical accuracy /complexity tradeoff and Figure [20] depicts the effectiveness of the /3 
heuristic. See captions for further comments. 
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Figure 10: Train set (top) and test set (bottom) negative log-likelihood (perplexity) for 
the Boltzmann chain MRF with pseudo/fuU likelihood selection policy (PLl/FL). The x-axis, 
P, corresponds to relative weight placed on FL and and the y-axis, A, corresponds to the 
probability of selecting FL. PLl is selected with probability 1 and weight 1 — p. Contours 
and labels are fixed across columns. Results averaged over several cross-validation folds, i.e., 
resampling both the train set and the PLl/FL policy. Columns from left to right correspond 
to weaker regularization, cr^ = {500, 1000, 2500, 5000}. The best achievable test set perplexity 
is about 190. 



Unsurprisingly the test set perplexity dominates the train set perplexity at each (column) . 
For a desired level of accuracy (contour) there exists a computationally favorable regularizer. 
Hence ^""^ acts as both a regularizer and mechanism for controlling accuracy and complexity. 
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Figure 11: Train set and test set perplexities for the Boltzmann chain MRF with PLl/FL 
selection policy (see above layout description). The x-axis is again jS and the y-axis perplexity. 
Lighter shading indicates FL is selected with increasing frequency. Note that as the regularizer 
is weakened the the range in perplexity spanned by A increases and the lower bound decreases. 
This indicates that the approximating power of increases when unencumbered by the 
regularizer and highlights its secondary role as a regularizer. 
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Figure 12: Train set (top) and test set (bottom) perplexity for the Boltzmaim chain MRF 
with lst/2nd order pseudo hkelihood selection policy (PL1/PL2). The x-axis corresponds to 
PL2 weight and the y-axis the probability of its selection. PLl is selected with probability 1 
and weight 1-/3. Columns from left to right correspond to cr^ = {5000, 10000, 12500, 15000}. 
See Figure [To] for more details. The best achievable test set perplexity is about 189.5. 



In comparing these results to PLl/FL, we note that the test set contours exhibit less perplexity 
for larger areas. In particular, perplexity is lower at smaller A values, meaning a computational 
saving over PLl/FL at a given level of accuracy. 




Figure 13: Train (top) and test (bottom) perplexities for the Boltzmann chain MRF with 
PL1/PL2 selection policy (x-axis:PL2 weight, y-axis:perplexity; see above and previous). 



PL1/PL2 outperforms PLl/FL test perplexity at — 5000 and continues to show improve- 
ment with weaker regularizers. This is perhaps surprising since the previous policy includes 
FL as a special case, i.e., (A, /3) = (1,1)- We speculate that the regularizer's indirect con- 
nection to the training samples precludes it from preventing certain types of overfitting. See 
Sec. 18.41 for more discussion. 
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Figure 14: Demonstration of the effectiveness of the /3 heuristic, i.e., using S™"' as a plug- 
in estimate for 6^0 to periodically re-estimate /3 during gradient descent. Results are for the 
Boltzmann chain with PLl/FL (top) and PL1/PL2 (bottom) selection policies. The x-axis is 
the value at the heuristically found (3 and the y-axis the value at the optimal p. The optimal jS 
was found be evaluating over a j3 grid and choosing that with the smallest train set perplexity. 
The first column depicts the best performing /3 against the heuristic /3. The second and third 
columns depict the training and testing perplexities (resp.) at the best performing /3 and 
heuristically found /3. For all three columns, we assess the effectiveness of the heuristic by its 
nearness to the diagonal (dashed line). 



195 191 



For the PL1/PL2 policy the heuristic closely matched the optimal (all bottom row points 
are on diagonal). The heuristic out-performed the optimal on the test set and had slightly 
higher perplexity on the training set. It is a positive result, albeit somewhat surprising, and 
is attributable to either coarseness in the grid or improved generalization by accounting for 
variability in 9"^"^ 



8.4 Complexity/Regularization Win- Win 

It is interesting to contrast the test loglikelihood behavior in the case of mild and stronger L2 regularization. 
In the case of weaker or no regularization, the test loglikelihood shows different behavior than the train 
loglikelihood. Adding a lower order component such as pseudo likelihood acts as a regularizer that prevents 
overfitting. Thus, in cases that are prone to overfitting reducing higher order likelihood components improves 
both performance as well as complexity. This represents a win-win situation in contrast to the classical view 
where the mle has the lowest variance and adding lower order components reduces complexity but increases 
the variance. 

In Figure [5] we note this phenomenon when comparing ct^ = 1 to ti^ = 10 across the selection poli- 
cies PLl/FL and PL1/PL2. That is, the weaker regularization and more restrictive selection policy, i.e., 
PL1/PL2, is able to achieve comparable test set perplexity. 

For the text chunking experiments, we observe a striking win-win when using the Boltzmann chain MRF, 
Figures [TOl and [T2l Notice that as regularization is decreased (comparing from left to right), the contours 
are pulled closer to the x-axis. This means that we are achieving the same perplexity at reduced levels of 
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Figure 15: Accuracy and complexity tradeoff for the CRF with PLl/FL (left) and PL1/PL2 
(right) selection policies. Each point represents the negative loglikelihood (perplexity) and the 
number of flops required to evaluate the composite likelihood and its gradient under a partic- 
ular instance of the selection policy. The shaded region represents empirically unobtainable 
combinations of computational complexity and accuracy, ct^. 

computational complexity. The CRF however, only exhibits the win-win to a minor extent. We delve deeper 
into why this is might be the case in the following section. 

8.5 A, Interplay 

Throughout these experiments we fixed and either swept over (A, /3) or used the heuristic to evaluate 
(A, /3(A)). Motivated by the sometimes weak win-win (cf. Section we now consider how the optimal 
changes as a function of A. In Figure [21] we used the /3 heuristic to evaluate train and test perplexity over a 
(A, cr^) grid. We used CRFs and the text chunking task as outlined in Section fS. 3.21 

For the PLl/FL policy, we observe that for small enough A the optimal cr^, i.e., the with smallest test 
perplexity, has considerable range. At some point there are enough samples of the higher-order component 
to stabilize the choice of regularizer, noting that it is still weaker than the optimal full likelihood regularizer. 
Conversely, the PL1/PL2 regularizer has an essentially constant optimal regularizer which is relatively much 
weaker. 

As a result, we believe that the lack of win-win for the chunking CRF follows from two effects. In the 
case of the PLl/FL policy the contour plots are misleading since there is no single that performs well 
across all A G [0, 1]. For the PL1/PL2 there is simply little change in regularization necessary across A. 

9 Discussion 

The proposed estimator family facilitates computationally efficient estimation in complex graphical models. 
In particular, different A) parameterizations of the stochastic composite likelihood enables the resolution 
of the complexity-accuracy tradeoff in a domain and problem specific manner. The framework is generally 
suited for Markov random fields, including conditional graphical models and is theoretically motivated. When 
the model is prone to overfit, stochastically mixing lower order components with higher order ones acts as 
a regularizer and results in a win-win situation of improving test-set accuracy and reducing computational 
complexity at the same time. 
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Figure 16: Train set (top) and test set (bottom) perplexity for the CRF with pseudo/full 
Ukelihood selection policy (PLl/FL). The x-axis corresponds to FL weight and the y-axis the 
probability of its selection. PLl is selected with probability 1 and weight 1 — /?. Columns from 
left to right correspond to cr^ = {5000, 10000, 12500, 15000}. See Figure [Tol for more details. 
The best achievable test set perplexity is about 5.5. 



Although we cannot directly compare CRFs to its generative counterpart, we observe some 
strikingly different trends. It is immediately clear that the CRF is less sensitive to the rel- 
ative weighting of components than is the Boltzmann chain. This is partially attributable 
to a smaller range of the objective-the CRF is already conditional hence the per-component 
perplexity range is reduced. 
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Figure 17: Train (top) and test (bottom) perplexities for a CRF with PLl/FL selection 
policy (x-axisFL weight, y-axis: perplexity; see above and Fig. Ill[). 

Perhaps more evidently here than above, we note that the significance of a particular /3 is less 
than that of the Boltzmann chain. However, for large enough cr^, the optimal /3 7^ 1. This 
indicates the dual role of PLl as a regularizer. Moreover, the left panel calls attention to the 
interplay between /3, A, and . See Sec. 18.51 for more discussion. 
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policy (x-axis:PL2 weight, y-axis:perplexity; see above and Fig. Ilip . 

Although increasing A only brings minor improvement to both the training and testing per- 
plexities, it is worth noting that the test perplexity meets that of the PLl/FL. Still though, 
the overall lack of resolution here suggests that smaller values of A would better span a range 
of perplexities and at reduced computational cost. 



[2] J. Bcsag. Spatial interaction and the statistical analysis of lattice systems (with discussion). J Roy 
Statist Soc B, 36(2):192-236, 1974. 

[3] Y. Bishop, S. Fienberg, and P. Holland. Discrete multivariate analysis: theory and practice. MIT press. 



23 



0.2 0.4 0.6 0.8 1 2 4 6 8 10 6 8 10 12 

approx ]i at approx p at approx |5 

Figure 20: Demonstration of the effectiveness of the /3 heuristic. Results are for the CRF 
with PLl/FL (top) and PL1/PL2 (bottom) selection policies. The x-axis is the value at the 
heuristically found /3 and the y-axis the value at the optimal /3. The first column depicts the 
best performing /3 against the heuristic /3. The second and third columns depict the training 
and testing perplexities (resp.) at the best performing /3 and heuristically found /3. For all 
three columns, we assess the effectiveness of the heuristic by its nearness to the diagonal 
(dashed line). See Fig. [T?] for more details. 

The optimal and heuristic /3 match train and test perplexities for both policies. The actual /3 
value however does not seem to match as well as the Boltzmann chain. However, if we note 
the flatness of the /? grid (cf. Fig. ll7l and ll9|l this result is unsurprising and can be disregarded 
as an indication of the heuristic's performance. 

1975. 

[4] R. Casella and C. Robert. Monte Carlo Statistical Methods. Springer Verlag, second edition, 2004. 

[5] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, second edition, 
2005. 

[6] T. S. Ferguson. A Course in Large Sample Theory. Chapman & Hall, 1996. 

[7] N. Hjort and C. Varin. ML, PL, and QL in markov chain models. Scand J Stat, 35(l):64-82, 2008. 
[8] R. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, 1990. 

[9] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods 
for graphical models. Machine Learning, 37(2):183-233, 1999. 

[10] D. Lewis, Y. Yang, T. Rose, and F. Li. RCVl: A new benchmark collection for text categorization 
research. Journal of Machine Learning Research, 5:361-397, 2004. 

[11] G. Liang and B. Yu. Maximum pseudo likelihood estimation in network tomography. IEEE T Signal 
Proces, 51(8):2043-2053, 2003. 



24 




0.2 0.4 0.6 



Figure 21: Optimal regularization parameter as a function of (A, /3(A)) for PLl/FL (left) 
and PL1/PL2 (right) CRF selection policies. In the left figure, PLl/FL, A represents the 
probability of including FL into the objective. A few FL samples add uncertainty to the 
objective thus a weaker regularizer is preferable. As more FL samples are incorporated, this 
effect diminishes but still acts to regularize since the full likelihood (only) best regularization 
is cr^ — 500 (red triangle). The right figure, PL1/PL2, exhibits only a minor change as A (the 
probability of incorporating PL2) is increased. It is however, best served by a much weaker 
regularizer than PL2 alone (red triangle). 



[12] P. Liang and M. I. Jordan. An asymptotic analysis of generative, discriminative, and pseudolikelihood 
estimators. In Proc. of the International Conference on Machine Learning, 2008. 

[13] B. G. Lindsay. Composite likelihood methods. Contemporary Mathematics, 80:221-239, 1988. 

[14] D. J. C. MacKay. Equivalence of linear boltzmann chains and hidden markov models. Neural Compu- 
tation, 8(1):178-181, 1996. 

[15] Y. Mao and G. Lebanon. Isotonic conditional random fields and local sentiment flow. In Advances in 
Neural Information Processing Systems 19, pages 961-968, 2007. 

[16] R. J. Serfling. Approximation Theorems of Mathematical Statistics. John Wiley, 1980. 

[17] F. Sha and F. Pereira. Shallow parsing with conditional random fields. Proceedings of HLT-NAACL, 
pages 213-220, 2003. 

[18] C. Sutton and A. McCallum. Piecewise pseudolikelihood for efficient training of conditional random 
fields. In Proc. of the International Conference on Machine Learning, 2007. 

[19] A. W. van der Vaart. Asymptotic Statistics. Cambridge University Press, 1998. 

[20] C. Varin and P. Vidoni. A note on composite likelihood inference and model selection. Biometrika, 
92:519-528, 2005. 

[21] E. P. Xing, M. I. Jordan, and S. Russell. A generalized mean field algorithm for variational inference 
in exponential families. In Proc. of Uncertainty in Artificial Intelligence, 2003. 

[22] S.-C. Zhu and X. Liu. Learning in Gibbsian fields: How accurate and how fast can it be? IEEE T 
Pattern Anal, 24(7):1001-1006, 2002. 



25 



A Proofs 

The proofs below generalize the classical consistency and asymptotic efficiency of the mle [6] and the cor- 
responding results for m-estimators [12]. They follow similar lines as the proofs in |B] and [TO], with the 
necessary modifications due to the stochasticity of the scl function. We assume below that pe{X) > and 
that X is a discrete and finite RV. 

The following lemma generalizes Shannon's inequality 1,5^ for the KL divergence. We will use it to prove 
consistency of the SCL estimator. 

Lemma 2. Let {Ai,Bi), . . . , {Ak, Bk) be a sequence of m-pairs that ensures identifiability oi pe,d £ 6 and 
ai, . . . , afc positive constants. Then 

fe 

Y,akD{pe{XA,\XB,)\\p9'{XA,\XB,)) > (30) 
i=i 

where equality holds iff 6* = 6*'. 

Proof The inequality follows from applying Jensen's inequality for each conditional KL divergence 

-D{pg {Xa^ Xb-) Pe' {Xa, \Xb,))^^ pe log |y . < log Ep^ \ \ 

Pe{XA,\XB^) pe{XA,\XB^) 

log 1 = 0. 

For equality to hold we need each term to be which follows only if pq^Xa^ \XBj ) = Pe' {Xa, \XBj ) for all j 
which, assuming identifiability, holds iff 6* = 6*'. ■ 



Proposition 1. Let 8 C M'' be an open set, pe{x) > and continuous and smooth in 6, and (Ai, Bi), . . . , {Ak, Bk) 
be a sequence of m-pairs for which {{Aj, Bj) : Vj such that Xj > 0} ensures identifiability. Then the sequence 
of SCL maximizcrs is strongly consistent i.e.. 



P ( lim 9n = eo) = I. (31) 



Proof The scl function, modified slightly by a linear combination with a term that is constant in 6 is 

n k 

^^'(^) = {z^,\ogpe{X^2]\X^]) - A,logp,„(X«|xW) 

By the strong law of large numbers, the above expression converges as n — oo to its expectation 

k 

m ^-Y^P, A, D{pe, {Xa, \XB,)\\pe {Xa, \Xb,)). 
If we restrict ourselves to the compact set S — {9 : ci < 116* — 6*0 1| < 02} then 



sup sup I V ZjPj log Pe [Xa^ \Xb, ) - Aj/?^ logpeo (^a, \Xb,) < K{x) < 00 (32) 
9e5 z 1^ 



where K{x) is a function satisfying EK{X) < 00. As a result, the conditions for the uniform strong law of 
large numbers ^ hold on S leading to 



P\ lim sup\sd'{e)~ fi{e)\=o\ = 1. 



(33) 
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By Proposition O ^{9) is non-positive and is zero iff = O^. Since the function y,{9) is continuous it 
attains its negative supremum on the compact S: supg^g ^{9) < 0. Combining this fact with (|33p we have 
that there exists N such that for all n > iV the scl maximizers on S achieves strictly negative values of 
scl' {9) with probability 1. However, since sci'{9) can be made to achieve values arbitrarily close to zero un- 
der 9 — 9a, we have that 9™^^ ^ S ioi n> N . Since ci, C2 were chosen arbitrarily 9™^"^ — > 9^ with probability 
1. ■ 



Proposition 2. Making the assumptions of Proposition [T] as well as convexity of C M'' we have the 
following convergence in distribution 

V^(C'' -9o)'^N (0, TET) (34) 

where 



T-i = ^/3,A,Vare„(V5e„(A„i?,)) (35) 

E = Var,„ f5,\,VSgMi,B,)^ . (36) 

The notation y^if g^iY) represents the covariance matrix of the random vector Y under pg^ while the 
notations ^ , ^ in the proof below denote convergences in probability and in distribution [5] . 
Proof By the mean value theorem and convexity of 8 there exists rj e (0, 1) for which 9' = 6*0 +'ri{9™^^ — 9q) 
and 

Vscini9r') = VscC(^o) + V2sc£„(0')(C'' - ^o) 
where V f{9) and V'^ f{9) are the r x 1 gradient vector and r x r matrix of second order derivatives of f{9). 
Since 0„ maximizes the scl, Vsc4i(^™^') = and 

V^(C'' - 9o) = -V^(W^scen{9')r^Vsc£M. (37) 

By Proposition [T] we have 9^^^ A 9o which implies that 9' A 6*0 as well. Furthermore, by the law of large 
numbers and the fact that if Wn A W then g{Wn) A g{W) for continuous g, 

{V'sce„{9'))-' A iV^sclM)-' (38) 
^ k 

k 

For the remaining term in (j37p we have 

k 1 

j=l ^ i=\ 

where the random vectors = V logpe(xj^'^ |xj|^) have expectation and variance matrix VareQ(Wij) 
AjVareQ(VS'0j,(ylj, Bj)). By the central limit theorem 

1 " 

^-Y^W^.^N (0, A, Var (V^^o {A, , B,))) . 

^ ■ 1 



27 



The sum ^JnV scin{6o) — ftV" n ^"=1 asymptotically Gaussian as well with mean zero since 

it converges to a sum of Gaussian distributions with mean zero. Since in the general case the random 
variables -/n i X]"=i i = ^-^ ■ ■ ■ correlated, the asymptotic variance matrix of -^n Vsc£„(6'o) needs 

to account for cross covariance terms leading to 

V^Vsc£„((?o) ^ N |^0,Var9„ {^^p,\jVSs,{Aj,B,)^ . (39) 

We finish the proof by combining (P7)) . ([55]) and pop using Slutsky's theorem. ■ 

Recall our notation for the case that the true model P ^ {pe : 6 ^ 6}. 

M^,Z)^^VmgiX,Z) (40) 

tpe{X, Z) = V'^mg{X, Z) (matrix of second order derivatives) (41) 
1 " 

^^^{0) = - T^Mx^'Kz^'^)- (42) 

Proposition 3. Assuming the conditions in Proposition [T] as well as supg.||g_g|^||>£ Af(0) < A/(6'o) for all 
e > we have ^Jf''' — ^ 0o as n ^ oo with probability 1. 

Proof We assert 



P \ lim sup \sd\Q) - ii{Q)\ =0^ = 1. (43) 

on the compact set 5 = {0 : ci < ||0 — 6'o|| < ci) as in the proof of Proposition [TJ We proceed similarly along 
the lines of Proposition [1] with the necessary modification due to the fact that the true model is outside the 
parametric family. 

Since the function [i(ff) is continuous it attains its negative supremum on the compact S: supg^^ /^(^) < 
A*(^o) > 0. Combining this fact with (|43| we have that there exists N such that for all n > the scl 
maximizers on S achieves strictly negative values of set' (ff) with probability 1. 

However, since sct'(0) can be made to achieve values arbitrarily close to /i(6'o) as 6'„ — > 6*0, we have that 
^ 5* for n > iV. Since ci, C2 were chosen arbitrarily — >• 6*0 with probability 1. ■ 



Proposition 4. Assuming the conditions of Proposition [5] as well as E p(jf)E pj^) ||-!/;0o (A, Z)|p < oo, 
Ep(x)Ep(2)i/)0g(A) exists and is non-singular, |^y | — \d'^xl}g{x) / d9i9j\ < g{x) for all i,j and in a neigh- 
borhood of 0Q for some integrable g, we have 



MOn - Oo) = -(E p^^^Ep^z^i;^^)-^ J2 V'eo(AW, ZW) + op(l) (44) 
or equivalently 

e„^eo- iEp^x)^p^z)i^e,)-'^f2i;eo{X^'\Z^'^) + op (^-^) . (45) 

Proof By Taylor's theorem there exists a random vector 6„ on the line segment between 9o and d„ for 
which 

= *„(4) = ^^nido) + *„(eo)(^n - ^O) + l{0n " ^o)^*«(^„)(4 " ^o)- 
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which we re-arrange as 

V^*«(^o)(4 - ^o) + V^\{en - OoY^Mik - ^o) = -V^^nien) (46) 

= -V^*„(eo)+op(l) (47) 

where the second equality follows from the fact that A 6'o and continuous functions preserves converges 
in probability. 

Since 4'„(0o) converges by the law of large numbers to E p^^x)^ P{Z)''Pe{X, Z) and 5'„(0„) converges to a 
matrix of bounded values in the neighborhood of (for large n) , the Ihs of ()46p is 

[£p(x)^P(z)MX,Z) + op(l) + ]^{en - 0o)Op(l)) {k - Oo) 

= V^iEp^x)^P^z)MX,Z) + op{l)){0r,-9o) (48) 

since 6'„ — 6*0 = op(l) and op(l)Op(l) = op(l) (the notation Op(l) denotes stochastically bounded and it 
applies to 5'„(6'„) as described above). Putting it together we have 

V^(Ep(x)Ep(z)^e(X,^)+op(l))(^„-0o) = -\A^*n(^?o) + op(l). 

Since the matrix E p(x)^ p{z)'4'si-X, Z) +op(l) converges to a non-singular matrix, multiplying the equation 
above by its inverse finishes the proof. ■ 



Corollary 1. Assuming the conditions specified in Proposition |4] we have 

v^(4 -fiio) ^(O,(Ep(x)Ep(;j)7/'0j-i(Ep(x)Ep(2)Veo^Jo)(Ep(x)Ep(z)^/'0o)"^)- (49) 

Proof Equation ([2^ follows from (I^Hl) by noticing that due to the central limit theorem 5'„(6'o) (as it is an 
average of n iid RVs with expectation 0) 

n 

• - ^ Veo (^^'\ ^^'^) - A^(0, E p(x) E p(^) VeoO- 

i=l 

Substituting this in the right hand side of (1201) and accounting for the modified variance due to the matrix 
inverse results in 
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